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I. INTRODUCTION 

I 
j 

Diagnostic testing in education is undergoing a revolution. On 
one hand a fair number of specialized test protocols are extant which 
are called "diagnostic," a large quantity of statistical and 
psychometric theory can be applied to the diagnostic question, and 
computer technology promises' to deliver into the hands of the 

cl as* room teacher systems which will teach, test, diagnose and 

/j 

remediate a- variety of educational offerings. On the other hand, 

- ■ • • ■ / j * 

diagnostic testing 1p most areas of education builds on weak 

theoretical foundations, makes use of few statistical tools and none 

of the wealth of experience available from diagnostic testing in other 

professions, and wi^th rare exceptions tioes not yet draw on the power 

of computers. 




This paper examines the history of approaches tojiiagnosis in 
education, and in a profession with far more concentrated, attention to 
the conceptual an<2 mathematical underpinnings of diagnosis, the field 
of medicine. We present a comprehensive model of diagnostic testing 
in education and a summary of the results of four studies , one from 
each of four separate heuristics developed within the model.. The 
paper concludes with a discussion of the advantages, disadvantages, 
and possible productive directions for, educational diagnosis, 
particularly in the realm of individualized adaptive diagnostic- 
testing administered by computer. 

The phrase "diagnostic testing" has been used in f education ever 
•since the first formal intelligence tests were devised. From the 
beginning the nominal intent of educational diagnosticians appears to 
have been relatively stable: "...the taking of certain symptoms that 
exist and' finding out from them what the trouble is" (Kallom, 1919, 
p. 11). While 'the diagnoses themselves, the process by which diagnosis 
is reached, and the management decisions which follow have undergone 
numerous .and extensive revisions; whether to build on a disease model 
or some alternative such as learning theory has. been a constant source 
of controversy. Except in reference to specialized psycho-educational 
and physical handicaps, however, with 'few exceptions tnere is not a 
great deal to show for the effort (Tyler & White, 1979) 

The common thread behind most' approaches to educational diagnosis 
in the past seven decades has been the use of tests to provide 



specific information about the difficulties of an individual student 
which will point to some appropriate remedial treatment , the pnrase 
"diagnostic testing" increasingly is being used for the assessment of 
learning difficulties w,ithin the classroom. In order to arrive at a 
diagnosis of individual patterns, existing tests which use the 
diagnostic label diverge widely in their approach, yet all are 
concerned to some degree with -the following key elements: 

a) examination of patterns of performance and achievement of an 
individual student,. 

b) construction of a summary profile of strengths, and 
' weaknesses,' i 

c) identification of the specific misunderstanding, 
misconceptions, -and misinformation that lead the individual 
student to perform poorly. 

Viewed in this manner, diagnosis of difficulties experienced by an 

individual student could lead to 'appropriate management strategies for 

further] learning, remediation, re-education*, or referral. 

4 The earliest efforts at developing diagnostic strategies in 

education were predicated on a very similar rationale. UhVs (1917). 

diagnostic method emphasized close examination of each pupil 's methods 

of work and questioning of students aloud while'they solved a 

problem. Uhl developed a series of hypotheses concerning students 1 

incorrect methods and recommended grilling pupils 1n methods which are 

"more effective" than those theylal ready employ. Andersqjx (1918) 

discussed diagnostic testing in reference to seven types of errors in 

long division. Subjects were given individual oral tests in which 



they were' asked to think aloud and to say what they Were thinking and' 
doing while solving the problem. Andersons aim was to enable 
teachers to become % diagnosticians of "mathematical diseases." 'Paula's 
(1924) Diagnostic Testing and Remedial Teaching gave numerous examples 
of tests in spelling, writing, redding, arithmetic, geography and 
history which had diagnostic potential. Paulu urged that teachers 
observed their students 1 working procedures ana learn to recognize in- 
dividual difficulties. The number of times each problem was • 
Incorrectly solved, the body movements made by the child while 
working, and the use of finger-counting were a few of the examples 
viewed as important signs 'of difficulty to be followed by specific 
individual remediation. . 

The first volume of Journal of Educational Research contained a 
study of diagnosis of error types (Willing, 1920); the first volume of 
Journal of General Psychology contained^ major -article by Spearman 
(1928) on the "Origin of error"; the second volume of the British 
Journal of Educational Psychology presented a lengthy analysis of 
theories of cognitive error (Fortes, 1932). In general, errors tend, 
tcr show themselves^ as matters' of either principle (suc(i as faulty 
reasoning, misunderstanding, or inability to apply a correct method, or 
strategy) or accuracy (such as errors in copying, manipulating 
numbers, or misplacing parts of the -problem) . 

While these historical documents present a minimum of 
sophisticated conceptualization, the' present status of many • 



application-oriented publications. 1n educational diagnosis Is not many 

» . 

steps further. Despite the Intention to make use of charts, graphs, 
and profile analysis, how-to books like Smith (.1969/, articles like 
Okey (1976), and computer programs like Furlong and Miller (1978), for 
example, contain relatively few substantive advances'^ either the ~ 
specificity of diagnosis or the range of options available to teachers 
1n both developing and utilizing a given diagnostic test.' Moreover, / 
°two essential definitions often appear absent from diagnostic tests. v 
and manuals. The first Is the meaning of the word "pattern; 11 a number 
of sources use this word but ItsMrieaning varies rather widely: 

a) "pattern" as profile of total scoresMn a curricular domain 
accumulated across a variety of tests adntlnlster-ed throughout the 

/school year ("a pattern of deficient test scores. In spelling"); 

b) "pattern" as profile of.subscale scores] assembled 1 from a 
single test administered once ("a pattern of misunderstanding of 
two-digit arithmetic"); ■ , 

c) ^' 1 pattern" as consistent behaviors across differing situations 
("a pattern of hyperactivity");' . , 

dV "pattern" as unusual responses tQ a setof test items ("a' 
pattern pf responses which points to carelessness on this test"); 

, e) "pattern" as specific ^erroneous responses within a set of test 
items ("a pattern of responses which demonstrate consistent < 
errors in logic"). 1 . , 

The various. writers do not appear to have thought that "pattern" 

rajses such plethora of possibilities. The 11st is not exhaustive", 

•* * * 

nor are the entries- mutually exclusive, but often defflcient. 
Recommendations for Interpreting "patterns," hinge 'on the reader's 
correct choice of de,fi nation. 



- The : second word requiring definition, surprisingly, is 
"diagnosis" itself. ^A number of educational writers refer to 
"diagnosis" as" if tt is either perhaps self-explanatory or too trivia 
to discuss, yet an adequate- definition, of the term is critical for 
purposes of further refinement and application. 
As effort'is expended in developing and administering diagnostic tests 
of increasing, sophistication;- an improved definition of diagnosis can 

be developed by examining the range of present applications, the- state 

' x f 
of ^theory concerning diagnosis, arid contributions from the field* of 

medicine-. The following sections address each in turn. 

II; VARIETIES OF DIAGNOSTIC TESTING IN EDUCATION' 

1 ! . Testing ability vs. testing achievement . , 

The vast majority of tests in education today may be grouped into 
•one of two categories: (a) specific or general ability tests (e.g., 
intelligence tests) designed to.measurea student's Innate ability or- 

potential, and (b) achievement te sts designed to measure how much, a 

*' , ■ . 

student has learned. " - , ; , . 

In a sense, almost. any standardized test of ability or 

achievement may be regarded as diagnostic. But the practice. of. • 

educational testing has broken Into distinct categories, ofWch the 

major ones are placement and .selection \ grading and' certification, 

motivation and research as well as diagnosis. The placement arid 

selection .operations grew directly from the work of Alfred B1net and 

*S * 

the extensive use .of objective intelligence tests during World War I 

12 ■ : 



in the evaluation and placement: of new* recruits. Soon a wide variety 
of intelligence and achievement tests were being made available to 
employers and to technical arid vocational training institutions for 
the purposes of screening n'ew applicants, and both achievement and 
intelligence tests are in worldwide use today for placement. 
Following the meaning of "pattern" as a profile of subscale scores, 
placement and selection tests are "diagnostic" in the sense that a 
pattern of test profiles may be used for differential assignments. 

Objective tests are widely used to study -various aggregate 

* / 

; aspects of the educational process. This category of use encompasses 
measurements embedded in the. design of educational experiments, the 
evaluation of new educational programs or curricula, and the 
monitoring of district, state-, or national levels of achievement. 
Achievement tests are used extensively to measure outcomes and hence 
to "diagnose" the effectiveness of instructional programs, specific 

v ■ ■ • ■■■ ■ 

school districts, or individual teachers. The use of objective tests 
in the certification process at the end, of a specific program, of 
education or training is seen primarily as a method of maintaining 
standards over time and are often only crude diagnostic indicators. 
Since changes in general ability or intelligence are thought to be 
mostly beyond the scope of the educational 'system,- the use of ability 
tests 1n a research or evaluation' setting is usually not treated 
diagnosticaMy, but rather as a way of controlling the experimental 
design or of 11 explaining" "away some of the observed variance of 
achievement scores. 
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Diagnostic testing of the individual student is the category with 
which thvs_ paper is primarily concerned. In the same way that the 

psychologist has a wide range of diagnostic ability measures to aid 

.* 

the identification of various sensory defect^ or brain dysfunctions, 

</ 

now the teacher has access to "diagnostic" tests as well. While the 
notion of diagnostic achievement testing has been around for several 
decades, the appearance of large numbers of objective achievement 
tests which purport to be .diagnostic is a recent phenonemon. ^Clearly 
a model of the diagnostic process which' translates directly to the./ 
classroom setting and needs of the teacher to diagnose education ^ 
problems would aid in understanding and utilizing the range of 'tests. 
One model -which begins to meet these needs ,1s provided by Thomas 
(1983) and is presented below. 

It is useful to distinguish between testing for specific learning 
disabilities and more general assessments /of learning achievement. 
Hennessy (1981) points out that the primary use of individually 
■ administered tests in schools today *i% to obtain descriptions of 
functioning for the purpose of diagnosis of children thought to be 
learning disabled, neurological ly Impaired, developmental ly disabled, 
or emotionally disturbed" (p. 42). Indeed codes of practice 1n many 
states require that Individually administered abilities measures shall 
be Included as part of the diagnosis of children prior to their 
classification or assignment to special educational programs. 

y •- . * 

' '. ■ ' u 
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A variety of different conditions are subsumed under the general title 
specific learning disabil ities , and there seems little doubt that many 
of these conditions do result from, or are related to, particular 
brain malfunction or damage. A variety of psychological tests have 
been devised to assist their identification, though Arter and Jenkins 
(1979) and Hennessy : (1981) point out how limited the .evidence of 
validity is for these tests. However, such tests are generally the 
prerogative, of the trained cl inical psychologist, and are not 

customarily used by (and may not be legally available to) the 

i - 

classroom teacher. But the classroom teacher's needs are not 

identical: frequently the task is not one of locating disability or 

disturbance but rather one, of finding and understanding where a 

student has encountered a block, is using an erroneous strategy, or 

has been otherwise left by the wayside. 

2. Testing and diagnosing individual educational performance 

„ Thomas (1983) distinguishes between diagnostic anci other forms of 

evaluation in terms of 'the sort of question each addresses and the 

uses typically made of the evaluation data: 

."With diagnostic evaluation, the question consists of two parts: 
what 1s the pattern of strengths and weaknesses 1n the students 1 
achievement of the l earning. goals, and what causes uhderly such a 

pattern? Resul ts of~wclnlforgno^^^ 

treatment of a student's learning weaknesses, either through 
remediation qf underlying causes or through helping the pupil 
learn more adequately despite the causes." (p. 13) 

Basic to this approach to diagnosis is the Interpretation of the 

pattern of performance scores. Here^the u&g^of the word "pattern" can 
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be interpreted both as a profile of subscale scores. and as consistent 
or unusual references and behaviors. Although it is not- essential , 
often such patterns are (derived by comparing an individual's 
performance with that to be expected based on the results of some 
reference group ,gan approach may appropriately be described as 
"norm-referenced". / 

Thomas recommends a methodical approach to the diagnostic use of 
tests, whether by classroom teacher or school psychologist. He points 
i out the errors that can result from steps being omitted and shor| cuts 
being taken. For example, a very poor reading performance as measured . 
on a general abilities test may stem from any one of a variety of 
completely Unrelated causes, and further investigation is necessary 
before appropriate treatment can be confidently prescribed. J 

Thomas 1 approach to diagnostic assessment of students, shown in 
Figure 1, is comprehensive although time consuming. It succeeds in 
codifying what teachers are supposed to be doing when they provide 
individualized instruction. The model is not limited to the 
norm-referenced approach and may also be applied to , 
criterion-referenced testing, as will be discussed below. 
Furthermore, it may s ucceed in identifying and diagnosing the. causes 
of major problems, although it is less likely to be sensitive to 
specific misunderstandings, misconceptions, and misinformation which 
may be significant to an individual student in his mastery of a given 
, topic. 

\ .,5. 
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Figure 1 

The Thomas Diagnostic Evaluation Model 



Stage 1 : Status Assessment 
Critical questions : 1.1 



1.2 



What are the specific objectives the 
student is expected to have achieved? 

What assessment techniques can best 
determine how well the student has 
achieved those objectives? 



1.3 What -pattern of discrepancies between 
-* expectations and performance is 
identified by these techniques? 



Stage' 



Cause Estimation 



Critical questions: 



2.T What reasdns for the deficiencies 

revealed in 1.3 need to be considered? 

2.2 How can these possibilities be 
evaluated? 

2.3 9n the ba$is of these evaluations, 
^ what is the most likely cause (or 

combination of causes) for the pattern 
in 1*3? ■ * 



Stage 3 : Treatment 

Critical questions: 3.1 What treatments would help the student 
. • most effectively given 1.3 and 2.3? 

What evaluation techniques are 
available to determine how well the 
treatment is succeeding? w . x 

As assessed by these techniques, how s 
successful is the treatment? 



3.2 
3.3 



(After Thomas [1981.], p. 15-16) 



An increasing number of commercially published standardized 
achievement tests. are now incorporating the "diagnostic" label into 
their title. However, it would seem hard to justify the label for any 
test that produces only a single score. Not only do such tests, 
provide no indications of the likely cause of a particular result and 
no suggestions as to appropriate remedial treatment (as required by 
Thomas 1 model), but the single score can be only a small part of the 
data needed to build up the pattern on which normative diagnosis 
rests. A reading comprehension test may indicate, with high 
reliablility and validity, that a sixth grade student is reading at 
the fourth grade level, but the information needed for .diagnosis of 
the student's problems would not be found unless some detail such as 
subscale scores or specific erroneous response patterns is also made 
available. I^would be more reasonable to reserve the term diagnostic 
for batteries of standardized tests which yield fairly complete 
profiles of performance in normative terms—the interpretation of 

-si 1 

which might well suggest both causal factors and remedial' treatments. 

Such patterns of scores, or normative profiles, are very 
important in norm-referenced diagnostic testing. A key Issue for the 
practitioner is the level of detail on which the components of the 
profile are differentiated. Component elements of three different 
profiles produced by three hypothetical-diagnostic test batteries 
might be: 



General Achievement Tests 

Reading comprehension, Handwriting, Math skills, Social Studies 
concepts, Science facts and concepts. 
Mathematics Test.' 

Computation skills, Fractions, Numerical reasoning, Algebraic 
manipulation,' Geometric similarity and congruence 

» • * 

Magnetism Test 

Magnetic and non-magnetic materials, Magnetic attraction and 
repulsion, Concept of a magnetic pole; Induced magnetism, Concept 
of a magnetic field, The Earth's magnetic field* 

Although each relies on the same underlying theory,, the 

'interpretation of results and the prescription t)f remedial treatment 

would be quite different In each case. The first example gives only 

global information but migh^ be helpful In indicating whether or not a 

student's problems stem from a perception problem, a linguistic 

'difficulty, or some type of specific learning disability with 

physiological roots. By contrast, the second 11st of profile 

components will be chiefly useful In Indicating areas of Instruction, 

which have not been mastered by the student, due to some dislocation 

of the normal teaching/learning process. For students with very 

discrepant patterns 1t may Indicate a need for substantial remedial \, 

study. 

The, third 11st of profile components represents an assessment of 
performance object1ye-by-object1ve. .While this might appear the most 
useful form of assessment for detailed Implementation of an 
Instructional program, 1t must be recognized that a great deal of time 

> • . s ■ ■ 

is required to obtain reliable estimates of Individual profiles at 
this level of detail. By aggregating the results of just a few Items 



across the students in a class, a teacher cfuUe economically can 
obtain feedback as to how well the class has mastered specific 
objectives, information helpful in planning' the next step of the 

teaching sequence. However, this approach does not x>f ten provtde 

■ r 

useful information at the individual level. , 

In each case scores on the component parts of diagnostic profile 
may be interpreted as deviations from the norm. Notice, however, that 
"norms" are established by averaging the scores for large numbers' of 
students, antf this does not imply that a flat profile, indicating even 
levels of development, i\ to be expected for any or all students. The 
achievement of most children does not proceed in. an orderly and 
regular fashion, and we should not expect to. find unchanging scores as 
we move from one area to another. Nevertheless, experience suggests 
that substantial unevenness of development [say two grade levels 
between subject areas] likely indicates more than a passing, 
disaffection with one subject or another, and further investigation 
would be appropriate. Components of diagnostic profiles wi thin a 
particular curriculum area may be expected to be more closely related, 
particularly if there are strong logical connections between 
• sub-areas, as in mathematics. Even so, the typical student will do 
better in some areas than in others, and unless the differences are 

~~ ' T — " — — . 4 — 

extreme, a serious learning problem 1s not necessarily Indicated. For 
diagnosis of learning object1ve-by-object1ve, norm-referenced 
Interpretations have limited -utility. This type of diagnostic battery 

20 '' ' 
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is more effective if it can be interpreted in criterion-referenced 
terms—especially if the sequence and structure of objectives is 
supported by cognitive learning theory. 

Both Thomas (1983) and Hunter (1979) stress the importance of % 
accumulating a wide variety of 'evidence -upon which "to base an 
educational diagnosis. Test' scores by themselves can be misleading 
unless considered in the context of the conditions under which they 
were obtained, the past performance of tho student under 
consideration, scores of pupils of similar maturity who have been 
exposed to similar Instruction, information about the student's 
linguistic background, etc. For example, while 1t is entirely proper 
that test scores form a part of the data on which any important 
classification or assignment of a w student to a special educational 
program is based, test scores should not be used alone for such 
purposes, but should always be supplemented by appropriate contextual 
information! Likewise,/ test scores are one of raany sources of 
information upon which a teacher draws in. making instructional 
decisions. , 

[On the other hand, the use of test scores by an individual for 
self-diagnosis may be quite effective. The student can integrate 
diagnostic feedback 1f appropriately presented with past experience 1 
[ order to help determine what topics or pr i ncTpTes he n e ed s~to~stffdy 
more carefully. More research on this type of self-directed learning 
is needed.] 



The important distinguishing characteristic of norm-referenced 
testing the determination of "detailed profiles — rests heavily not < 
only oh the reliability of the particular test and Us administration, 
but also on the demonstrable validity of tfie reference norms, an(J the 
Implicit assumption- that normetf profiles, which are composites of many 
individual profiles, honestly reflect a developmental reality* ■ Few 
children proceed with their education in an orderly and regular 

7 ... 

fashion; we should not expect to find unchanging scores as we move 

from one area to another. Even within 'a single domain, the typical - 

student performs better 1n some areas than others. Thus, the 

norm-referenced approach to diagnostic testing has shortcomings, which 

-are difficult to surmount. % \ 

* '* 
In brief, the norm-referenced approach to diagnostic testing has 

two major shortcoming!. The. first 1s the question of the relevance of 

' any particular set of norms "to the student being tested, a question 

easy to. raise but not to resolve 1n the vast majority of cases. The 

second problem concerns the large number of test Items which must be 

used 1f reliable and detailed objective diagnostic profiles are to be 

developed. Can these problems be avoided by switching to a 

criterion-referenced approach? v 

* Criterion-referenced tests explicitly attempt to indicate what 

performances should be^expected for students with a given score, 

without referlng to the scores of' any other student. In such tests, 

the Issues of relevance to norms and ranking of students are traded, 
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for an issue'about the adequacy of items in relation to the criteria 
being used. A good criterion-referenced test will generate a large . 
amount of information* about the overall achievement of a student even 
from a small number of test items. Because we do not need to relate" 
the pattern of performance for an individual to that of a large 

^ normative group 1n criterion-referenced testing, the testing procedure 
Itself can be made more flexible; an Individual student need not. 
attempt, all Items. Adaptive testing, In which the sequence of items 
presented *to a student depends upon the student's previous 
responses, offers a much more efficient way of gathering .Information 
about the student's achievement and may reduce substantially the time 
needed to develop a reliable profile (Green, 1983). 

A good example of a diagnostic test that uses this adaptive 
approach is the KeyMath Diagnostic Arithmetic Test (Connolly, 
Nachtman, A Prltchett, 1971). This test lies somewhere between the 
pure criterion-referenced and norm-referenced approaches since it has 
elements of both within Its design. The entire Instrument consists of 
209 test Items divided Into 14 different components of a diagnostic 
profile. The diagnostic profile Is developed on a large sheet which 
effectively provides a map of arithmetic attainment with the different 
content areas listed down\the page and the item difficulty levels 
moving from "easy" on the left to "difflcul t n oh the right. An 
extract from' the, complete .profile sheet 1s presented 1n Figure 2.. The 
circled numbers represent the position of particular .Items on tjie 
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> Figure 2 - • 

Scaled summary profile of performance from the 
KeyMath Diagnostic Arithmetic Test 
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different scales. Items of equal difficulty appear vertically above 
or below one another. Scaling according to the Rasch latent-trait 

» • * " ■ 

^ model is used to establish tffe^e relative difficulties so they form a 
setiof relationships expected to be valid for all students, and not 
" only those belonging to a particular normative group. However, a , 
"grade equivalent" scale Is also. -provided on the diagnostic sheet so 
that normative interpretations of performance, are possible. 

• J V 1 

« I 

* The strength of this system Is that it is' adaptive to the needs of the 
individual student \ „ . 

3. Analysing errors 

For most of its history, achievement testing has-been dominated 
by the "number correct" method of scoring, so little' attention -has " 
been pajd to the nature of the erroneous responses given by students. 
Where' mistakes ,haVe been studied It Is" to award partial credit for 
an answer to an open-ended question that *as nearly correct (for 
example^ 1n Great Britain), or .for choice of the least Incorrect 
distractor to a multiple choice item (chiefly In the United Statjes). 
Although both teachers ancf measurement special 1sts usually, agree that 
incorrect test responses contain diagnostic Information about the" 
student's 'performance, there have been few systematfc attempts to 
exploit this Information. • - % 



'The ad vent, of .computer technology, in recent years has led to 

vV ' * 

several .attempts at redressing this situation/. For example, Brown i 

Burton (1978) developed the "BUGGY" system, a \computer1 zed game for 



25 



traiaing teachers in diagnostic 'Skill s, which plays- the role of a 
student answering questions. The teacher 1 s Jtask is to recognize the 
source of the student (computer) error, and to become more sensitive 
to the causes of students 1 learning problems. Under this system, 
simple* types of error or "bugs" can be easily diagnosed, although 
diagnosis becomes much more difficult when the student has several 
bugs which* may Interact. ' 

In setting 'up such a system, the initial identification of ? 
misconceptions, or bugs, that produce errors is a complex task. It - 
requires the analysis of each skill under study and of the "procedural 
network" of subskills, and ,a listing of the correct and incorrect 
procedures for applying each of these,. In ttie view of Brown and 
Burton, this network analysis needs to' be comprehensive for It must 
contain all possible misunderstandings.. .The need to be comprehensive 
restrlcted^Browh and Burton to the rather narrow task of addition and 
subtraction.* Even within this field, the number of*bugs to be 1 
.considered 1s quite large., 

This approach has been further elaborated by K.K% Tatsiioka' and, 
colleagues^ at the, Up1vers1ty of Illinois ^Blrenbaum A' Tatsuoka, 1980; 
Tatsuoka,' et al., 1980; Tatsuoka & Tat.suoka, 1983). They have alsp v 
concentrated on skills of addition* and. subtraction of s1gned"numbe^s' • 
using open-ended questions. A major concern of /this group- was- that . 
students might obtain- the right answer to a quest/I on by applying .* 
Incorrect reasoning, so that the simple "number right" score on a test 



might be an inaccurate indication of achievement. By careful 
structuring of . test* questions, they showed^that i t "was "pdssTBTe" t<T ~ 
infer when a student was using incorrect rules to obtain the correct 
answer to a specific item by an analysis of responses to other items* 
Revised ^scores were produced by rescoring as "incorrect" any correct 
response deduced to have been reached by wrong reasoning and the 
revised scores were shown to be superior on each of a number of 
'measurement criteria. This research also demonstrated the inadequacy 
of factor analysis as a technique for "investigating the structure of 
achievement tests., To a significant extent, the factor structure « 
appears to' be determined, by- the pattern of misconceptions held by the 
students as well as "by the content of the items themsel ves. Tatsuoka, 
•et al. (1980) introduced the "Individual Consistency' Index" (ICI) 
which, when applied to the pattern of responses for an individual, can 
indicate the extent to. which the student is using "erroneous rules" to 
solve the problems. However, as pointed out by Tatsuoka and Tatsuoka 
(1983),, since most tests do not have J the special structure required 1 
for the calculation of the ICI, the. method has a limited application. 

The detailed analysis that was required to produce a workable' 
system in signed number arithmetic suggests that 4 *11 be r\o 

general all-purpose computer prQgram that will be able to magically 
(diagnose a pupil's erroneous answers to testMtems regardless of the 
subject matter. A full analysis of the logical steps 1n problem 
solution outside the area of mathematics is likely to bg heyond the 



capabilities of teachers, curriculum specialists, and professional 

test constructors. ✓ r 

j\ x 

However, Nesbit (1966) did demonstrate an approach* to diagnostic 1 
testing that teachers could handle* * It requires that-a teacher 
catalogues the important errors and/or misconceptions common among 
students in a particular curriculum subdomain, and then-writes 
multiple choice items in which the incorrect alternatives (or 
distractors) reflect, these common misconceptions. Simple analysis of 
the responses to. a set-of such questions can indicate ^whether a 
student is operating under a particular misconception or not* Though 
far >ess comprehensive than the Tatsuoka and Tatsuoka system, and not 
based on a detailed logical analysis, this approach appears to be much 
more, practical . Even so, experience suggests that the cataloging of 
error .types in a way that multiple choice questions can differentiate 
between them stfll requires considerable preparation, and groups of 
'teachers working together may find this more feasible than individual 
teachers*/ The use of multiple choice rather than open-ended questions 
has the disadvantage of denying a student with an unusual 
misconception or erroneous rule from the opportunity to demonstrate 
it, biit^ does focus on the main or most frequently encountered errors. 

III. DIAGNOSIS IN MEDICINE 

y ' / • " ' l 

While diagnosis 1n many areas of education" has been making rather 

slow progress, the last dozen years have seen enormous growth in 

theory and practice In the field of medicine* 'The limitations which 
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prevent wholesale borrowing/ from medical applications are both obvious' 

(does a "disease" m o del apply to educational difficulties?) and subtle 

(does educational measurement achieve equal probabilistic 
accuracies?). However, recent developments within medicine, and 
especially within the technical field of artif1cal*1ntelligence in 
medical diagnosis, make it impqrtant for educator's to examine the 
sucesses and failure even if details of the diagnostic question are 
not completely parallel between professions. 

The practice of diagnostic medicine has been under refinement "for 
as long as medical schools have existed in America. Indeed for an 
extended period of time, except for a few medicinals and a limited 
surgical repertoire, the practice of medicine was virtually restricted 
to the formulation of diagnoses. In this one area, physicians were 
able to develop extensive and often labyrinthl an. categories within 
categories, developing and occasionally discarding the pieces of 
diagnostic nosology, building a foundation of modern diagnostic 
practice. Today's general practitioner faces thousands of possible, 
fully legitimated, diagnostic situations; for the common fcinds of 
illness, all of the following are likely to be true: 

a) the category is a recognized and documented disease entity; 

b) the status indicators - signs, symptoms, and relevant history, 
are either specifically understood and delineated or, at worst, 
have already undergone detailed study; . 

c) the probabilities associating the presenting symptoms with a 
variety of disease hypotheses are known fairly closely; 

' , 29 
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d) the probabilities associating the various diseases with 
best-fitted therapeutic strategies are known in a general way; 

e) the "prbBabil i-ttVs~a>Sp'c1attng""each"d1 sease and~i ts-recommended 
treatment with patient outcome are at least roughly estimated; 

j 

f) the- various combinations of costs and benefits, including 
situations in wfilch two or toore diseases are compounded with each 
other' In the same case, are calculable. 



Thus many presenting patient problems can often be translated by 

i 

cookbook Into unambiguous terms: the medical problem is "x 11 within a 
specific confidence Interval, its course Is fully anticipated with 
(and without) treatment "yVand such treatment has a closely 



predictable likelihood of benefit at a known cost. 

From such a highly defined diagnostic structure has emerged a 
variety of sophisticated models used to explain the manner in which 
'the professional' enters and exits the diagnostic question, how the 
various paths are profitably explored, and how the disease entity, in 
time, is understood both statically and dynamically (Gheorghe, Bali, 



H111 & Carson; 1976; Miller, Westphal & Reigart, 1981; Patil, 
Szolovits & Schwart2, 1981, 1982; S2olov1ts, 1979; Szolovits & Pauker, 
19,78). ' 

Gorry (1970) defined diagnosis in the medical context as 

...the problem solving activity directed toward the 
classification of a patient for the purpose of relating 
experience with past patients to him and of assessing the 
therapeutic and prognostic implications of his condition 
(p. 293). 

The diagnostic model which ensues 1s a problem-solving approach, 1n 
which the professional 's knowledge, maintained as a generalization 
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trom his professional education, is brought into focus in aligning the 

particular signs and symptoms to the closest similar known disease. 

;The process 1 s in three-parts t—the-obtdining-of— information^— the 

evaluation of decision alternatives, and the making of suitable J 

diagnosis or the obtaining of additional information if the- diagnosis 

is not yet indicated. It is a mgdel as much of cognitive functioning 

as of diagnosis itself (and, perhaps suprislngly, enbodies certain 

strong resemblances to the model of educational diagnosis developed by 

Thomas). The idea of a decision tree, and a number of mathematical 

properties associated with such processes, have been explicated in 

detail (see Jacquez, 1972; Lauder, 1981); the decision tree enters the 

physician's strategy at the point of evaluating decision 

alternatives. The model is carried further by such writers as 

El stein, Schulman and Sprafka (1978), who point out that many 

physicians do not enter the problem- sol ving approach without already 

having formed a series of working hypotheses: 

Early generation of tentative diagnostic hypotheses is ... 
used by clinicians. to bound the regions of the potential 
problem space most likely to yield the solution. The 
subsequent workup is planned to permit testing or 
refinement. ..The method used to narrow diagnostic hypotheses 
and reach closure about problems or treatment alternatives 1s 
a form of means-end analysis in which specific clinical 
findings or clusters of finding serve as operators or movers' 
to reduce the distance between the point where the problem 
solver 1s and where he would like to go (p. 278). 

In a massive study of diagnosis and computerisation in medicine, 

Williams (1981) presented a series of viewpoints' about the diagnostic 
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process oriented around the orderly and logical clustering of 
phenomena by the observer. A major question posed by Williams is 
~"when to stufly"Wd^ 
categorical, probabilistic, artificial intelligence, and pattern, 
recognition models, each of which carries an extended and precise 
mathematical definition.- 

..•Categorical approaches are particularly appropriate when 
the Individual... "doesn't know where to start", Twhen Hp seeks 
focus and context 1n a comple's and ill -bounded area, and when 
decision choices mqy be optimized and then standardized 
according to categorical criteria. Probabilistic approaches 
are most useful for limited and clearly bounded problems with 
mathematical T$ manageable numbers of variables. When "good" 
and' relevant data are available, classic probabilistic 
approaches are applicable and may be used to support and 
refine expert decisions. When such data are not available, 
expert judgment may be codified using pseudoprobab1llst1c 
>. techniques and plausible reasoning, procedures that are also 
important in propagating even well supported Uncertainty « 
estimates, derived from classic probability, between models 
at different levels (vol.1, p. 156). 

The diagnostic situation in medicine involves, 1n its simplest 

form, the nature of the illness, the skills of the professional 1n 

discovering the exact specifications of that Illness, and the tools 

available to aid that discovery process. In the, first two areas the 

last decade has seen extensive research in statistical modeling of 

diagnostic classification; diagnostic probabilities^ optimization 

strategies, \and decision paths. In the last area, there has been an 



explosion of\effort 1n relation to computerization of the diagnostic, 
process. i 
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A number of writers (Barr & Feigenbaum, 1982; Blois, 1980; 
Rogers, Ryack- and Moeller, 1979; Weiss, Kulikowski, Amarel , & Safir, 
1978) h ave provided overviews of computer-aided medical diagno sis. • 
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Over the past two- decades, ^several extremely sophisticated interactive 
. inquiry programs have been ^executed;, the end user 1s prompted for 
specific Information and shown, at appropriate places, the variety of 
possible diagnoses under consideration. MYCIN, for Instance, utilizes 
a strategy of narrowing Its options based on Its conversation with the 
medical professional 1 at a computer terminal until a point at which 1t 
'can state a diagnosis, Its confidence 1n that diagnosis, some 
alternative diagnoses 1f applicable, and a recommendation for course 
of treatment 1n both expected and adverse circumstances. The typical 
configuration of a computer-basejd diagnostic system Involves a 
disease- symptom database, a combination of heuristic and statistical 
algorithms for developing decisions, and through the Input of %he 
medical professional, Interactive contact with the target case during- 
the diagnostic process and again upon confirmation of the diagnosis. 
The last step provides a feedback mechanism with which the program can 
validate Its database. These approaches are not without controversy 
(see discussion section below) but the potential for computerization 
of the diagnostic process 1n medicine has bfeen thoroughly 
demonstrated, 

Specific Illustrations can be ..found even 1n areas where the 
v experienced clinician faces a challenge. One diagnostic problem 1n „ 
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newborns occurs because a wide variety of congentlal malformation is 

* *. • 

possible yet any single physician is likely to encounter them rarely. 

-Xomputer_-programslnow„exisi^ 
diagnostic "determination to be made .by a computer which accesses 224 
different postnatal syndromes. Bone marrow evaluation, a pathological 
speciality which relies on extensive amounts of complex data, Is 

1 currently being conducted oh a experimental basis using a 
microcomputer (Wheeler, 1983). The program collects data from several 
sources, provides textual and graphic Information to the medical 
professional, and concludes with a Disease Attribute Matrix Score, 
which combines symptoms and statistical weights to yield a tentative 
diagnosis or ruleout. This can be accepted or returned for revision, 
1n which instances the user enters a series of Increasingly selective 
queries 1n an attempt to 'further refine th£ working hypotheses. 

Probabilistic modeling of medical decision making 1s. another 
topic In current development for microcomputers (Galen, 1983; Savage, 
1-972),, apparently with success. Over the remainder of this decade, 
the profession anticipates Increasing reliance 'oh computer technology 
not only N 1n the making of specific diagnoses to fit specific 
'individual cases, but 1n enabling the medical professional to Improve 

, the entire diagnostic process* N 
IV. A COMPREHENSIVE MODEL OF DIAGNOSIS IN EDUCATION 

A review of the successes of diagnostic theory and practice 1n 
medicine from the viewpoint of education Illuminates the following 
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general problems. Unlike medicine, which draws from extensive 
experience" with most disease entities, educational diagnosis seldom 
has the same unambiguous reference base. While medical diagnosis 
successfully employs probabilistic methods, educational diagnosis only 
occasionally/ has sufficient amounts of information to support , 
probabilistic techniques. Medical diagnosis builds on strong 
Inference, but educational diagnosis has developed only portions of 

/' 

the/necessary inference techniques which would allow the same degree 

/ 

pf success. 

7 . 

As Hennesey (1981) Illustrates, educational diagnostic 

specialists have accumulated "a vast amount of rich data and insight 

to support their practices" (p 1 . 56). Yet the preset status of models 

of diagnosis in education is significantly behMnd tn.it of diagnostic 

model s^in medicine 1n at least three respects. What appears to be 

lacking 1n education is the following: 

a) design of strategies: an explanation of what. the* 
diagnostic process specifically attends to (and what it - 
Ignores) well as what 1t requires the professional to do 
and the range of-optios available for doing such; 

b) accumulation of evidence: a definition of what constitutes 
sufficient information for finalizing a diagnosis and a 
recognition of the strengths, and weaknesses of differing 
1r<format1 on-gathering strategies; and 

c) computerization: use of computers to aid the teacher in 
collecting and evaluating data towards concluding in a 

. diagnosis. ^ 

The first two requirements, deal with thelscope of the diagnostic 
inquiry. Thomas 1 first "'critical question": what are the specific 
objectives which the individual student is expected to have achieved? 



The appropriate signs and symptoms are those which point to some 
failure in expected achievement with those specific objectives; the 
working hypotheses concern the variety of plausible explanations for 
such aT def 1 c1 1 TttHthat point, the second "requl reriierit indi cates "that" * 
the next task 1s to discover data which will narrow the 11st of 
working hypotheses appropriately. 

Within this context, a generalized model, adapted from Burke (1n 
Williams, 1981) with permission, shows how the task of diagnosis fits 
between the problem and the management solution. ,F1gure 3 traces the 
steps of this generalized model of diagnostic process. Initial signs 
and symptoms are" organized, following a theoretical base if possible, 
such that an Initial profile of the student's weaknesses can be drawn 
- together. This profile needs to address the target deficit with 
sufficient specificity (the substance* of the area of achievement must 
be represented adequately) and with sufficient selectivity (the range 
of performance^ thin the area of achievement must bracket the child's 
present capabilities) (Weiss, 1983). Ample consideration must also be 
paid to Instructional history (Tatsuoka & Blrenbaum, 1979). Working 
hypotheses ; are developed, the more formally associated with theory the 
better, based on an initial understanding of the pattern of responses, 
and from these hypotheses the most germalne diagnostic test strategies 
(elaborated 1n the following section) are brought Into play. 
"Pattern," 1n the context of Individualized diagnostic assessment, 1s 
used to reflect unusual responses' to a set of test Items, or a set of 
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specific erroneous responses acros similar items ,in a test* (For some 
testing strategies which explore the latter, see the accompanying 
paper by Choppin, 1983.) / 

Following the development 'of initial hypotheses, the ideal 
construction of a diagnostic process stems /from the professional 1 s 
careful reading of the evidence to date .anci sequencing of steps to 
gather additional evidence, until one of three actions can occur: 

1 ^ v ^ • 

< a) the initial hypotheses concerning the-specific educational 

»■- •>■' 

problem are supported by the tests; 

b) the initial hypotheses are supported but with and unacceptable 
level of ambiguity; <^ 

c) the initial hypotheses are excluded. 

If the initial hypotheses are supported by the tests, no further 
testing is required and the examiner moves, with some certainty, to 
the task of implementing an appropriate remediation, tailoring of the 
curriculum, re-education or referral. The examiners arrives at the 
diagnostic end point with. confidence and can optiml2e the selection of 
a management strategy for the use. However, the Initial hypotheses 
may not.be completely supported by the tests; further testing wWch 
might lend clarity .may be too costly 4 1n time or money. With some 
degree of uncertainty the examiner moves to the management of the case 
(and such management may consist of a referral for more specialized 
testing or* simply waiting- for some favorable turn of events). While 
the examiner traces the same path -on Figure 3, as for the successful 
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* FIGURE 3. Complex diagnostic and^management processes. (From Burke, M. 
. D., Mount t S1na1 Hospital, Minneapolis,, Minn. With permission.) 



diagnosis, the outcome 1s expressed in less certain Items. - „ 

Alternatively, the examiner can use further testing to Investigate ■ 
whether the ihltal hypotheses can be overturned and the working 
di agnostl c 1 nterpretati on excl uded J "eX.cl.y_s1 on certain") , or whether a 
different ^approach to the' problem can generate confirmation of the 
jnltal hypotheses from separate perspective ("conflrmatipn certain", 
the Initial hypotheses are excluded by one of three approaches* The 
diagnostic testing may prove. them untenable. Some "early" exclusion 
criterion such as strong evidence from prior testing or another 
professional, 1s provided wl$ch" obviates the need to explore the 

Initial diagnostic hypotheses further. . Or those hypotheses may be " 

» 

excluded by an "Exception trigger," a critical finding that manifests 
itself 1n psychological or educational difficulties but stems from a 
completely different domain altogether, for example7~organ1c jilnes*. 
These exclusions all lead the examiner away from the diagnostic 
endpolnt 1n the lower right corner of Fitjure 3, and each Implies that 
the 1n1talf hypothese were unsatisfactory. Further work 1s required, 
not likely Involving a second look at the Initial profiles of 
educational problems to generate a new act of working hypotheses. 

At this stage, the model has served to alert the teacher tp^ the 
possibility that a) Initial hypotheses fit within a context of .both 
available evidence and theory, and b)' these working hypotheses help ; 
determine both what further evidence to gather and what exceptions to 
CQnslder at'the same time. « More detail about the operations, within 



•diagnostic testing is offered below; at this juncture, however, it is 

important to note that three, goal s in tracing diagnosis' in Figure 3, 

/rem top left to lower right are to do so quickly; efficiently, and/ 

wfth a higlTTevel of confidence. ~Th~ese~~three are ;not~entirely ' 

exclusive but practical considerations mitigate heavily against the 

professional proceeding 1 well on all three accounts unless the data are 

also of high quality. l 

Acquiring data to. support (or remove) a working hypothesis can- 

proceed in several ways. Thomas (1983) supplies six possible sources 

of data: standardized tests, teacher-tnade-test^v^ot*^^ ~~ 

regular student assignments, unrecorded observations, rating scales 

and interviews. The section which follows explores options for formal 

test. strategies in diagnosis. The present state of educational 

te$tfn9 In d1ftqnn*1$ Is $u$t. emerging from an exclusive reliance on 

conventional ad seriatim testing and moving into *ich variety of other 

strategies, some of which are set forth in Figure 4A The figure 

portrays schematically the movement made by the studefit when faced 

with a single test item and the ervsuing possible decision points 

available to the examiner in each of four test strategies. The four 

general schemes are: • 

a) ad seriatim testing — tss$ items are- administered from 
first to last. No change in sequence is contemplated during 
^the test, and, generally the evaluation of the diagnostic 
hypothesis is not begun until completion of the test. Most j 
conventional- educational testing and a majority, of existing 
tests designed to be inherently diagnostic 1n application 
proceed in this manner. * 



b) answer-until -correct testing \— test -"items are 
administered from first to last, but a wrong answer returns 
the student to another opportunity to respond- to the same 
item again, with available answers reduced by one. The 
evaluation of, diagnostic hypotheses occurs as'the student 

, « repfeats the same answer strategy and obtains* similar 
, sequences of wrong answers from item to item. 

c) compress-decGmpress (or "stradaptive" [Thompson L Weiss, 
.1980]) testing test ~ items are administered according to a 

selection rule or structural lattice which allows a correct 
response to* one item to lead to an item of greater 
complexity, while' a wrong response to the first ite;m leads 
hext to an item of greater simplicity. The evaluation of 
diagnostic hypotheses occurs as. the student repeatedly 
selects similar erroneous responses across items, and/or 
c selects correctly at one level of test complexity but not at 
the next higher level; and/or selects dissimilar responses 
across items of the same complexity. 

c d) developmental testing test items are adminstere! 
serially, often across an extended period of time. The 
student's response to each item is codified 1n multiple ways, 
which may include appraisals of the method or methods the 
student utilized to reach an answer, the type of answer 
given, how the student chooses to represent that answer in 
some formal way such as, with text or symbols, and/or how the 
student reconstructs the original problem from the 
representation oShe made earlier. Evaluation of diagnostic 
hypotheses 1s possjble^ upon complete. codification of scores 
to each item. 

Each of ^he. four -"maps" for t traveling through attest has been 
used for tests which are not inherently diagnostic in nature. Nor do 
the four provide eiVher an exhaustive review of all possible test 
design strategies nor necessarily a set of practically exclusive 
heuristics: * it is entirely possible that advantages of one or another 
of *the designs can* -be folded into, a combined fdrm of testing, and/or 
that a single test could begin with one scheme but branch to another 
at some decision-point. However, the primary reason for 
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distinguishing from "maps" at persent is to demonstrate the differing 
sources of diagnostic information that occur within each: 

Ad seriatim testing: diagnostically useful information is 
available at the end of a complete sequence of items, but only under 
special circumstances -is information available before testing is 
terminated. 

Answer-until correct testing: diagnostically useful information 
1s available whenever students select incorrect answers, and such 
information can be used to terminate testing before the test item. 
However, the information provides no inmedlate guidance as to sources 
of error. 

Compress-decompress testing: diagnostically useful information 
1s available after each student response, because the correctness of 
the respose 1s used to' determine the next item to be presented. The . 
nature of the error made if the response is Incorrect can, be 
evaluated. The student may work towards some "balance-^ojnt" within a 
domain, in which more difficult Items cannot be answered without error 

'while less difficult Items pose no problem. 

\ ■ 

Neo ? iagetian testing: diagnostically useful information is 

* 

available while the student 1s making a response, after the student 
has completed the response, as the student works to draw or write down 
the problem as a representative of his/her thinking, and as the 
student views that draining or written narrative and talks about 
his/her memory of the problem and the response. 

. 43 
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All four approaches have analogues in medical diagnosis. The , 
fitat, seriatim testing, reflects the protocol followed in obtaining a 
patient's responses to a standard family medical history* the patient 
goes straight through until the final item without interference from 
the medical professional. The second, answer-iintil -correct testing, 
mirrors the "protocol used when portions of that same history are 
readministered orally for purposes of confirmation or further detail. 
The fourth, developmental testing follows to some degree the 
multi -modality testing used in such complex arenas as 
neuropathological diagnosis, in which £he professional uses a wide 
range of dissimilar tests over a period of time in order to Isolate a 
specific Impairment. 

The third approach, compress-decompress testing, reflects the 
more complex protocols frequently required to diagnose those problems 
for which multiple alternative explanations are not easy to ruTe out. 
As the professional' begins to believe s/he has acquired Information 
which fits, that information is Incorporated (or "compressed") into a 

" " i . 

more encompassing understanding of the problem, until, at some point 
in time; sufficient confirmatory data 1s 1n hand to allow, without 
further delay, a diagnosis and a plan of medical care. However, as 
the professional gathers Information which 1s disconflrmatory , the 
diagnostic process now moves to "decompress" the available 
information, and if necessary gather even more s data, until a plausible 
alternative hypothesis emerges with r some degree of certainty. 

ERIC 
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The general model of diagnostic process allows a perspective on 
possi bl e computer! zati on . Fi rst, using a co mputer^ to accompl 1 sh thi s 
process requires that enough is already known about particular sets of 
errors or problems to facilitate the formation of initial hypotheses. 
If true, then each of the four heuristic designs of Figure 4 can be 
brought within the strictures of the real-time interactive computer. 
Second, with the computer used for both .administration and statistical 
.analysis, the teacher car* engage interactively, during test 
administration or after, to provide additional (information for 
a categorical or probabilistic diagnostic assessment predicated on 
solid mathematical principles (Bock & Mislevy, 1982; Tatsuoka 4 Linn, 
1983; Weiss, 1982). Further conments about computerization follow 
later 1n this paper. 
DISCUSSION 

1. Diagnostic interpretations of illustrative data 
The following is a brief summary of findings from four studies of 
.test performance and diagnosis: an ad^sertatlm test of language arts 
skills (presented In detail as a separate report), an answer-uritH- 
correct test of arithmetic skills, a compress-decompress prototype 
test of understanding of science, and a developmental test of 
elementary number concepts. The first, second, and fourth tests 
adhere closely to the first, second, and; fourth heuristics of 

diagnostic testing presented earlier (ad\ser1at1m, 

.\ i 
answer-untll -correct, and neo Piagetian; the third served to 
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illustrate certain ^aspects of the third heuristic 



{compress-decompresjs) although it was delivered to students serially. 

The four tests were/ each designed to ref!ict very specific subject 

domains and were administered in different ways to different 

examinees: . • 4 

Language afts:: a 92 Item test of pronoun understanding, in which 
the development of the Items followed a rigorous structural 
Interpretation of pronoun usage and complexity, and of the 
sentence context within which target pronouns were 
embedded. The test Items were developed to reflect the 
application of six rules, of grammar 1n usage of first person 
plural, third person singular, and third person plural 
construct 4 ) ons. For each rule, six Items required the 
examinee to recognl2e and select the correct form and rule 
without making Inferences, and six required the student to 
Infer the correct form and concept from the Item stem. This 
test was administered as a paper-and-pendl test to 49 
4 Fluent, English Proficient and 79 Limited English Proficient 
sixth, graders 1n Los Angeles County* , 

' ' Arithmetic skills: A 10 Item test of arithmetic skills Involving 
addition, subtraction, and multiplication approximately 
geared to the sixth grade level. This test was administered 
on a one-to-one basis J»y microcomputer following the 
answer-untH -correct strategy 0 terns were presented again if 
the examinee's response was wrong, until such time as the 
right answer was selected from the remaining options or 
Itself was thtf cfnly remaining option). Examinees for this 
test were 68 fourth through eighth grade students attending 
summer courses In computing at UCLA. 

Understanding ot science: . a 20-1 tem test of selected concepts 1n 
- ' science, constructed with attention to two key factors: 
rational construction of d1s tractors within each Item and 
between related Items* and hierarchical ordering of Hems by 
complexity. The test Involved three kinds of detractors: 
logical "fallacy, intuition distraction, content distraction, 
presented 1n Items of low, medium and high difficulty 1n \ 
form selected topics 1n science/ This test was administered 
as a paper-and-pend! test to 190 students representing a' 
very large range of exposure to science concepts: 
high-talent private Junior high students* a mixed range of 



^ ability levels In public high school classes, and entering 
- college freshmen studying Introductory Biology, all in Los 
Angeles County. 

Elementary number concepts: a multi-part developmental test of 
selected concepts 1n counting and constructing one-and 
two-digit numbering wooden blocks. The test examined 
concepts, of number* including ^counting, adding/ subtracting, 1 
constructing with modular blocks and constructing 
combinations. This test, building extensively on 
neo-P1agettan theory, was administered adaptively on a 
one-to-one basis by trained examiners to 99 kindergarten 
* through second grade pupils in Santa Barbara County. fa 

The language arts test data was extensively analyzed by methods 

which address group and subgroup distinctions and differences between 

facets of the item design. Of interest to the present report are 

those findings which address individual performance. What emerges; 1s 

a„ profile of each examinee's performance presented as proportion of 

correct response to the Item facets, annotated by a statistic which 

addresses correspondence between profiles and various general 1zabil1ty 

coefficients at each facet. Selected cases show substantial variation 

in relation to the item facets, but classical test strategies show 

that the test itself is reliable and that certain expected ^patterns of 

performance (lower success with context-embedded pronouns than with 

the same pronoun in- an item without the embedding, for example) 

generally hold true. Diagnoses of Individual problems with particular 

r I o ■ 

7 / 

forms, of pronoun usage can*be easily drawn from examination of 
patterns of performance, on a case-by-case basis. In this sense, the 
predominant meaning of "pattern" 1s a proflle of subscale -scores. 
Some difficulties were common to all students,, and thus not inherently 



diagnostic. Other difficulties appled -to selected students, arjd for, 
.these the individual profiles should yield diagnostically useful 
information. Profile for students with fluency in English were 
paralleled by the profiles for Limited English Proficient students. 
Further discussion is found in the accompanying report (Webb et al , 
1983). ' 

The arithmetic skills answer-unt'1 Worrect test was analyzed by a 
variety of methods which primarily address issues of reliability and 
selection (for an extended discussion see the accompanying reports by 



Hi 1 cox;. 1983). The procedures evaluate the probability of correctly 
determining that a given examinee knows a given 
answer-until -correct model, probabilities were 



Intern. Using the 
estimated for each 



item: ttte first six had probabilities exceeding J85; -the remainder 



were at least .71 or greater. The probability of at least seven 
correct decisions (i.e., whether.it was correctly determined that an » 



examinee knows an item) could also be estimated for this test. Using 
/'-.-' \ I ' 

recent psychometric developments, 1t was determined that an estimated* 

lower bound of this probability value was »70, tiMIe the estimated 

lower bound of at least six out of ten correct decision was .83. Thus 



the test appears to be fairly accurate, although ^additional scoring 
rules which are useful in Improving accuracy had minimal effect with 
this dataset. 

Essentially, this analysis 1s premised orj/a latent-trait model of 
examinee behavlo^C^n^trich the harder Items generate more inaccurate 
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measures of whether or not the student .knows the correct answer, and 
at the same time call/into play more guessing behaviors, jyhe 



probability of making a givers number of correct decisions given the 

/ " 

total number of Items 1s analogous to a test of reliability, in that 
both generate Vs'lngle^ number which* characterizes the adequacy of the • 

set of test Items. These procedures, however, do not specifically 

/ 

speak to the problem of individual diagnosis. Instead, that Issue can 
be taken/up by other measures of individual performance using the 
probabilistic information of correct determination of the individual 's" 

latent state as a) base. However, it should be noted that methods 

/ 

whlctf using the first response only as Indication of right or wrong 
will not comport with the answer-until -correct analyses, because the 
latter are able to take the full nature of the response behaviors to a 
given item Into account. The only case irt whlth traditional measure 
of 1/0 scoring based on first response only will agree with 
answer-un til -correct analyses Is* the impossible ^case in which , 
examinees never find the correct answer if they miss an item on the 
first try. " - * 

It is Important to note that answer-uhtilf correct testing 

. \ - 

utilizes a highly specific definition of "pattern 11, in analyzing test 

performance: "pattern" is |aken to mean repeated attempts to secure a 

correct answer, with both the number of such attempts within a given 

f 

item and the number of Items requiring repeated attempts having a » - 
direct impact on tha associated statistics. The use of 
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answer-until-correct*- testing- to diagnose individual difficulties in a 
content domain could rely upon these two elements within the 
individual pattern of performance. 

Research efforts by a diverse group of- educational and 
psychological workers have explored the nature "qf logical thinking and 
hierarchical structuring of knowledge (ct. Cotton, Gallagher and 
Marshal, 1977 l, Dreyfus and Jungwirth, 1980; Rodgin, 1955) but in 

^ ~ ] ' . "V* 

general there remains a great deal of disagreement as to how 
hierachies may be assembled and to their validity and repeatability 
even within narrowly defined topic areas.' The point of view adopted 
is critical in determining the rest of the research that ensues. In 
the area of structure of mathematical concepts in school children, for 
example, recent publications by workers in England (Hart, 1981; 
Osborne, 1983), Russia (Krutetskii, 1976) and Finland (Keranto, 1981) 
appear to share very little' in common. Despite this, work has 
progressed towards analyzing tests in selected topics di'agnostically. ■ 
In the area of- diagnostic testing of mathematical abilities, Birenbaum 
and Tatsuoka's (1980) contribution is but one of a series from workers 
at the University of Illinois; for diagnostic testing, of science • 
( concepts several studies can be cited. which have proved at least 

partially successful (cf. Bartov, 1978; Gorodetsky a H02, 1980; s Long, 
Okey & Yearny, 1978). Johnstones'" (1981) review provides an excellent 
overview of problems 1n diagnostic testing 1n science. 

5 . 50 ' 
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The compress-decompress prototype test of understanding of 
§ • * * 

science was an attempt to incorporate structural hierarchies relating 

to conceptual understanding of selected science topics with a 

rule-based algorithm for the construction of each dtstractor to each 

item. A three level, comprehension strategy (factual knowledge, 

recognition of principle as well as factual knowledge, and application 

: • .' \ 

'as well as recognition of principle) was used to construct a twenty \ 

item.£est. Each item's four choices were restricted to a logical 
fallacy, an intuition, a faulty content similarity, and the correct 
response. (A detailed report of the results of .this study is found 1n 
Shaha (1993)). 

' In the context of the present paper, the important data elements 
from this endeavor are three in number: first, the general profile of 
responses across correct and incorrect alternatives for related items 
at different levels of comprehension; second, the general profile q£ * 
responses for those items across the same comprehension level; and 
third, the degree of variation of performance of Individual examinees 
wlih regard to both related Items and like levels. Diagnostic 
Interpretations can be derived directly from the third "pattern" 
listed here:^ the word "pattern" 1s taken to refer fioth to specific 
sorts of erroneous responses and to consistent (or Inconsistent) 
indicators of conceptual level across- differing subtopics.. In this 
test, missing more than one item at any level of comprehension was - 
almost always matched by a mass of at least two more difficult Items 

51 



1n the same domain. While error patterns .did not appear consistant 
across the entire test, there were consistent patterns of error, 
particularly logical fallacy and intuition errors, within topics. 

Developmental testing of children's number concepts among 
kindergarten, first and .second" graders was carried out in a variety of 
separate subtopics on a oh^-to-one basis by trained examiners; this 
extended dataset has been kindly* supplied by On. Jules Ziiraner of the 
University of California Santa Barbara. The data consist of four 

4 

.separate appraisals by the examiner for every target response: the % 

strategy by which a given number concept problem was solved, .the 

accuracy of that solution; the ability of the child to draw a version 

of what she or he did to handle the problem, and the reconstruction of 

the solution from that drawing a week later. Each of four sets of 

problems was evaluated in this manner, yielding extensive data which 

could be characterized as follows £ or .the majority of cases: 

the accuracy of the response was usually related to the 
strength of the-strategy employed by the child.. 

the ability to represent the problem was usually related to 
the accuracy of the solution* 

^ the ability to reconstruct the problem from the 

representation was often related to Initial J strategy. 

The diagnostic portion of the study concerned the question a$ to 

whether patterns of performance by a minority of students were erratic 

over the sets of problems. Here the use of the word "pattern" 1s\ 

taken to mean inconsistent behaviQrs across differing situations. 



Consistent behaviors were taken" as indication of level of mathematic 

ability, while inconsistent behaviors were viewed as>the key to how 
children would face specific trouble spots in their own understanding 
of number concepts. Because of the close contact between students and 
teacher at this school level, this dataset represents the one present 
source for which statistical flags for individuals can be compared 
blindly to the informal assessment made by their teachers* In brief, 
the diagnostic question was approached by evaluating the extent , 
of intrasubject agreement across problem sets. Within the subtopics, 
those individuals who were substantially inconsistent overall were 

•flagged* and the number of such flags totaled. Seven students were 
identified as having a pattern of responses which revealed erratic 
performance* These seven, plus two others, were the same , students 
Independently seen by teachers as currently 1n educational difficulty 
or HkeTy to require .close attention during the present school year. 
Further detail will appear in forthcoming reports. • 
2. Advantages and disadvantages of diagnostic testing 1n education . 

It 1s inaccurate to paint too jrosy a picture of computerized 
diagnostic testing, in education at 
psychometric research, the primary 
relatively course grtin /of measurement in educational testing. That 
is, for any single testYesponse or collection of test behaviors in 

1 most areas of education, no responsible party claims to know the 
complete underlying cause. or causes* In cbmputerized psychological 



this time. Despite extensive 
restrictions revolve* around the 
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testing, 1n contrast, certain test/responses, appearing as a set^car^, 
.be linked with very high confidence to a narrowly defined, and thus 
diagnosticatly strong, set of likely explanations. Likewise, in many-, 
instances in computerized medical diagnostic testing, <once beyond a 
critical mass of evidence there are few other plausible outcomes of a , 
testing algorithm in addition to the one or two primary diagnoses. 

.The obvious success with which diagnosis takes place in the field 
of medicine cannot be matched by comparable successes in most of the 
field of education. A variety of interrelated explanations for this 
current state of affairs are available, among which are problems of 
diagnostic definition, test construction, and practical management. 

However, once a certain number of problems are favorably 
resolved; it appears that using computers to score and Interpret 
diagnostic tests in educational settings can accrue the same 
advantages as. in the current practice of computerized testing 1n 
psychology and medicine. First Is the significant accumulation of 
hard evidence in the form of a computer databank of diagnostic 
Indicators. Until computerization, this bank exists mainly as 
Ipersonal experience. Until computerization, use of logically rigorous 
diagnostic procedures is markedly limited by being tied to 
paper-and-pencH Instruments. Uhtll computerization, adaptive 
exploration of possible diagnostic pathways 1s limited by the patience 
and agility of the teacher in bringing various parts of f a test / 
instrument- to the examinee at the appropriate moments. 
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With, an appropriate backlog of data, a computer-driven scoring 
procedure can efficiently sort results of a test administration 
following algorithms regarding hypothesis likelihood. The procedure 
cari evaluate an extended range of findings cooperatively across 
several different tests of the. same individual . ^ The procedure can' 
explore competing alternatives without prejudice, delivering in : 
conclusion a summary of findings, a statement as to the confidence 
level of those findings within the context of , the given tests, and - " 
potentially useful avenues for student remediation. 

One key problem requiring further research Is the problem' of 

properly encapsulating any respectable cross-section of subject matter 

within the highly restricted rules which govern both diagnostic 

testing .and computerization. That is, everi the most flexible 

diagnostic strategy, managed by the most Intelligent and 

"user-friendly" computer programs, Is likely to Involve severe 

trade-offs between optimal, measurement characteristics, the available 

level of "understanding" of language, bull t Into the program, and 

practical Issues of both test applicability and diagnostic 

interpretation. Experience with a promising computer-driven 

educational diagnostic algorithm In the Netherlands (Goblts, 1973) 

validates these concerns: * 

One can expect* ..severe dlfflcul ties... when trying to convey 
meaning by a language of very ^restricted code, 1.e,. a language. 

. with severe regulations as to how the form should be. In fact 1t 
turned out to be practically Impossible to shape richer subject 

* matter content Into the highly regulated forms of the suggested 
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'language'... The moral, of course, is that any 'language' one 
devises for testing and feedback purposes with more restricted 
code then natural language will pose practical problems. . .and 
take additional instruction. (R. Gobits, personal 
* ■• communication,' 1983) . 

Another problem to be resolved is the closer Integration of testr— 

objectives with .curricula. This requirement is addressed Infrequently 

but must be stressed. Even the most elegant statistically based 

computer-managed test sequence comes to naught 1f not^tled to the 

curriculum. The relative success of diagnostlc^testlng in rending and 

simple arithmetic may rest on the extensive acceptance 1n most school 

systems of reading and arithmetic curricula which generally cover the 

same explicit/ goals at the same grade levels even when teaching 

methods dif/fer widely. However, many subject domains within American 

primary and secondary education, ■ such as the physical and biological 

sciences, topics 1n mathematics beyond elementary algebra, and 

computing, are treated uniquely even between neighboring schools 1n 

the same district. With little common ground to stand on diagnostic 

i 

testing may be much more difficult to organize on a broad scale. 

However, it is only fair to Indicate that many of the concerns 
which pertain to educational diagnosis and c ^uterizatlon exist 1n • 
the best of efforts involving artificial Intelligence to solve 
diagnostic problems 1n medicine. Szolovlts and Pauker (1978) 
^evaluating a series of computerized medical diagnosis programs, 11st 
iseveral Important shortcomings: 
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1. Programs which deal with relatively broad domains. . .have 
inadequate criteria for deciding when a diagnosis is 
complete . . .The programs continue exploring less and less sensible 
additional hypotheses... 

2. Because the initial strategy. ..is to use every significant 
new finding. ..and because this strategy remains through the 
programs 1 operation, new hypotheses are continually being 
activated... v 

3. Part of the routine developed by clinicians is an appropriate 
order for acquiring Information systematically. Computer 
diagnosticians tend either to enforce such an order too 
strictly... or w not at all ... 

4. The programs rely on a global 'assessment scheme, but they use 
too weak: semantics for the states over which they try to compute 
approximate probabilities. . .None of the programs can dynamically 
distinguish among... aggregate hypotheses... Yet there are 
therapeutic and strategic decisions which hinge on just such 

..distinctions... (ppl39-140), 

Advances in computerized medical diagnosis since publication of this 
Important article have attended to, but have yet entirely resolved, 
these concerns. 1 

^iiagnostlc clarity 1s lacking in general educational practice, 
the areas of reading and speech aside, partially because, unlike , 
medicine or psychology, the field of education has only occasional 
databases which go beyond summary scores by which to examine one or 
mbre normative patterns of skill acquisition. Moreover, trie processes 
of skill learning even within very restricted areas such as arithmetic 
are only beginning to be understood at the same level of detail as, 
for example, acquisition of object permanence 1n Infants. In speech 
and reading diagnosis, and to some degree 1n elementary operations, in 
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arithmetic, diagnostic Instruments are available which allow efficient 
strategy, interpretation, and management. This success stems in part 
from a long cumulative history of effort in these areas and In part 
from the ability to define very closely the exact skills to be ; 
targetted at each step of the. student's development. However, even In 
speech, reading, and arithmetic, the field labors under an excessive 
number of plausible competing hypotheses, many of which compound one 
another. Thus the task of obtaining clear and unambiguous diagnoses 
1s seldom one which can be completed with a large degree of 
confidence. 

Test construction has advanced In countless respects during -the v 
last decade 1 , Including 1n particular the mathematical and statistical 
developments necessary to support alternative test strategies. : 
However, to construct' an adequate diagnostic test requires an 
additional series of considerations: given appropriately specific 
definitions, can bnefwrlte Items, for a conventional or non 
conventional diagnostic test, which are jointly corroborating, 
exhaustive of the viable alternatives, and parsimonious? The obvious * 
goal 1s to obtain reliable Items which demonstrates differential 
prediction of future performance. Within a test the related Items 
must be structurally coherent both 1n respect to Item content and type 
of response. Yet the same Items must also allow the student to°g1yie 
any significant logically interpretable response whether correct or 
erroneous. * 

t 

■■ ' ' V?; '' 58- 



From the viewpoint 'df practice, the management of a diagnostic 
test obviously requires more than the usual attention from the 

jTi J • 9 

teacher, and more effort to interpret. The student, though, may treat 
the experience in much the same manner as any other test, including 
Obtaining correct answers by erroneous methods and accidental guessing 
of correct answers Cboth of which make educational diagnosis 
especially difficult). The student may simply be sloppy in responding 
but the diagnostic protocol will .attempt to treat every answer, right 
or wrong, as equally legitimate. • , 
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L ist of Figures 




Figure i Iten^s Illustration Distractors Generated 
: According to a Plausibility Criterion 



Figure 2 Items Constructed According to 'a Logical 
Ejror Analysis ' 1 ■ , 



Figure 3 Items Illustrating Logical Error Analysis 
Distractors But on Unrelated Content 



Figure 4 Example spf a Theory-Based Item Structure 



1. Test construction strategies 

Because diagnostic testing depends critically on the -strength of 
the^test items, strategies for the development of the strongest 
possible items are essential. Four different strategies for writing 
multiple choice Items and' their dis tractors are considered below. 

Writers of multiple choice testing exhibit significant' 
disagreement about the role played by distractors, the Incorrect , 
al ternative responses. In part this stems from the -different uses to 
which test scores are put. In a criterion-referenced test an 
incorrect response directly conveys a piece of Information about the 
individual's achievement, but In a normative test It serves only as an 
aid towards ranking students on a total scale. However, some of the 
.divergence results from conflicting views of the strategies adopted by 
a student to answer a multiple choice Item. Thus this paper continues 
with a consideration of two analytic approaches to analyzing 'patterns 
of erroneous responses:, contingency analysis, 1n which attraction to 
similar distractors 1s tested categorically, and probabilistic 
evaluation, 1n which dlstractor attraction 1s tested by a process of 
probabilistic, differentiation among competing typotheses. 
. Strategy 1* The Plausibility Criterion i 
Using the plausibility strategy Item writers construct statements 
that will appear reasonable to an uninformed' person,. but which would. 
• be judged clearly Incorrect by an expert. Two Items constructed 1n 
' this way .from a Test, of the Understanding of "Science are presented in 
F-igure 1. As. a rule the correct response is written first, and thea 



- 3 - 



distractor statements are constructed to match it as nearly as 
possible in terms of length and linguistic complexity. Further, the 
distractors should appear sufficiently plausible to individuals with- 
'low achievement, so that a substantial proportion of examinees would 
be inclined to choose one of them rather than the correct answer. 
Estimates made by item writers ±of ^the plausibility of particular- 
distractors ar.e prone to error* and it is rarely possible to get a 

reliable estimate of a particular, distractor 1 s drawing power without 

V 

field testing the item in its complete form. A general guideline fpr 
test constructors who work in this fashion is that a distractor that 
attracts fewer than 10 percent of tffe erroneous responses ! is not doing 
its job adequately and should be replaced by a^pre plausible 
statement. . .. ,f 

H* 7 



\ 

\ 
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Figure 1: Items Illustrating distractors generated according to a 
plausibility criterion. 

6. V{e do experiments when we are learning science because: 

A. Experiments are used to test ideas by experience 

B. Experiments enable us to learn better. j 

C. Experiments make learning more interesting^ j 

D. / We can show that we all get the same results, j 

E. It is important 'co learn to handle apparatus , i 
skillfully. f j 

7. Why should one make a written note of all the observations make 
when carrying out a scientific Investigation? 

- / A. One might forget them, and they may turn out to be 
/ important later. 

/ B. It is a good way to train powers of observation. 

C. ' It trains one to think clearly and write accurately. 

D. Good scientists always do it. 

— E. One is supposed to" have a compl ete record of what has 

done. 



Source: IEA Test of Understanding Science 

This method of constructing multiple choice items, though very 

• ■ * * 

widespread, usually is of limited interest in diagnostic testing 
because the choice of a particular distractor seldom gives clear 
information about the learning problems of the Individual testee. 
Strategy 2. The Use of Most Frequent Errors 
The "most frequent errors" strategy, in its simplest mode, 
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consists of giving test items in an open-ended format to samples of 
individuals ajt an appropriate level in order to determine the three or, 
four erroneous responses that are given with the highest frequency. 
More pragmatically based than Strategy 1,. 1t "produces distractors that 
are plausible If rom the students? point of view, but it suffers from a 
major drawback^in that many of the most frequent responses produced by 
students will be almost correct. Thus high ability students and 
experts may not be able to discriminate between correct and incorrect 
responses with a high rat** of consistency, and overall Jiest 
reliability may be low. If the student-generated dist/actors are* 
modified by the test constructor to make them clearly incorrect then 
their built-in plausibility may disappear. Once again, note that 

distractors generated by this method are rarely intended to carry 

■" * ', * * 

diagnostic information. 

Strategy-3. - Logical Error Analysis 

Items can be designed such that the distractors reveal specific 
errors of logic and procedure; Figure 2 contains Items from an 
arithmetic test in which the distractors have been designed to be 
£hosen by students who make particular procedural errors. A student • 
who transfers incorrectly between the tens and units column in itefrt'9 
might be expected to pick response (E). In a test in which all the 
items concentrate on a narrow domain of skills, such as Integer 
addition or subtraction, 1t may be possible to Infer certain 
diagnostic condition* rrom the pattern of responses to the whole 
test. However, many of the multiple-choice tests that use this 
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approach use quite different techniques for generating the distractors 
to different i terns i depending upon the content of each item 
concerned.' Figure 3 gives an example of two such items from a science 
achievement test. Here the diagnostic information. revealed by a 
single incorrect response is too unreliable (see Tatsuoka, Birenbaum, 
Tatsuoka, and Baillie, 1981) to be interpreted, and since the evidence 
^provided by different items bears on different issues/ aggregation of 
the diagnostic Information from the Hems. 1s difficult. 



Figure 2: Items constructed according to a logical error analysis. 



10. 



53 
26 



44 

16 
T 



A. 33 

B. 37 

C 27 

D. 79 

E. 47 

A. 32 

B. 38 
C 48 

D. 28 

E. 60 
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Figure 3: Items illustrating logical error analysis for 
distractors but on unrelated content. 

12. Flour is a fine powder obtained by grinding wheat or. other 
' cereal grains. A pile of grain burns only very slowly 

whereas flour dust suspended in air is explosive? Which 
of the following is the best explanation "of this? 

A. The heat produced when small particles burn is greater 
than the heat produced by the burning of large 
particles of the- same substance, f , 

B. Grinding the grain changes its chemical composition. 

C. For the same quantity of the material, small particles 
have a greater surface area in, contact yiith air than 
large particles. 

D. Small particles possess more energy than»large 
• particles. 

E. The flour burns completely whereas the pile of grain- 
does not. 

/ ' * . — .fe ■ 

13. Two given elements combine to form ia poisonous compound. 
Which of the following conclusions about the properties of 
these two elements can be drawn from this Information?, ^ 



A. 
B. 
C. 
D. 
E. 



Both elements aYe certainly poisonous. 
At least one element ,1s certainly poisonous. 
One element is poisonous, the other Is not. 
Neither element is poisonous. 
Neither element need be poisonous. 



Source: IEA Science Test 4B 



/ 
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Strategy 4. Theory Based Distractors 

Items can be defined following a theory of the consistent parts 

-** * 

of erroneous understanding; this strategy for item writing is - , 

. , . \ ... 
comparatively rare.- In a typical example, the test constructor 

G if,'. 

attempts to use a theory of student cognitive behavior, a logical 
analysis of the subject area, or a personality theory in order to 
define a discrete number of response types, af}d to write distractors 
for each item that falls into one of these types. A good example. of 
such a test is the Cognitive Preference Style in Science test 

\ 

developed by kempa (Tamir & Kempa, 1978); a sample item from -this test 



is given in Figure 4.' Four styles of cognitive preference between 
which the test is designed to discriminate are recall .principles, 
questioning, and application. The item shown has no incorrect 

answers; instead it is hypothesised that a student whose preference is' 

° \ 

• for the recal 1 style would be most likely to selec^ response (A), 

* ' \ 

whereas a student whose preference is for application would tend to 

\ 

select option (B), etc. Such tests typically used not for routine 
assessment or diagnosis but for research," and in many cases the , 
evidence for thei r' theoretical validity- is. not 'strong. However, 
initial successes in using strict theoretical . ffameworks to construct 
such instruments suggests that it is possible to apply a more 
structured approach to the design of distractors for regular 
--diagnostic instruments , - ... 
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.Figure 4: Example, of a theory-based Item structure* 

0 

A gas spreads out to fill the volume of the containing vessel* 
(A) Gas particles are in a state of motion. 
/(B) The movement of the gas molecules enables us to 
experience smells at a distance from their, origin. 

(C) The spesd of movement depends on the mass of the gas 
molecules. 

(D) The gas molecules are in a state of perpetual motion 
because they possess kinetic energy. 

Source: Tamlr and Kempa, 1978.", 



2. Contingency Analysis 

• Multiple-choice achievement test data characteristically show 
a fair amount of Inconsistency: even the most able,students sometimes 
select incorrect responses, for reasons that are often unclear. Less 
able, students sometimes select correct responses to difficult 
problems, again for reasons 'that are often unclear, probably but not 
necessarily guessing at random even when one might hypothesize that 
their level of understanding would lead them to choose one particular- 
distractor. fi 

Tests must be composed of many items if reliability and precision 
are, to ie achieved; however, one major goal of diagnostic testing 
is to form reliable diagnostic judgments from the pattern of results 



of a parsimonious set of items. If meaningful diagnostic 

interpretations .are to be' predicated on the choice of a particular 

distractor, then, the itehiand its distractors must themselves be 

strong enough to sustain it; within a reasonable probability the 

selection of a particular distractor must reflect the examinee's state 

of learning and/or mislearning\in the domain. 

A straightforward approach to the investigation of this issue 1s 

through contingency analysis. If Items are functioning as expected 

- dlagnostlcally, we hypothesize that an\exatn1nee who responds with 

error "A" to one Item will also respond with error "A" to related 

items. It 1s appropriate to require, as evidence of the diagnostic 

validity of paired distractors, that a significantly larger number of 

like-error contingencies occur than would occur 1f responses were 

random. 

" , ' ■; ; v 

For each Item pair a table of frequencies of all possible J t 

response pairs 1s evaluatedby simple x 2 to show whether the pattern 

1s random. Inclusion of a 'correct' answer to both Items 1s" 

sufficient to render the x 2 value significant. An Unproved test 

calculates x 2 for the response grid after eliminating the row and 

column corresponding to 'correct' answers. If significant, a check 

can be made to determine 1f the predicted patterns/are those that 

occyr with -unusual frequency. 

3. Probabilistic diagnostic evaluation : 

a.) Theory of probabilistic differentiation 

' Many of the successes of diagnosis in medical practice can be 



to use of probabilistic rather than strictly categorical 
evaluation w < e <r ailab 1 ^ evi lence at any given time. 

Diagnostic testing in education can.be regarded as a /process of 

1 / 
probabilj^ic- differentiation between, al ternative hypotheses, one ot 

which is that the student ha$ actual ly mastered the |'mater1al '. ■ By 

constructing a test which combines a plausible set'|>f hypotheses based 

on common reasons for failure with a probabilistic 'scheme for 



evaluation (Box & T1ao, 1973; De Groot, 1970; Novlek, 1970; Novick, 

' . / 
Jackson, Thayer 4 Cole,. 1971), the diagnostic process can be made 

• , v ' I * 0 

efficient. - . ; 

/ 

One question which can be- addressed. frdm several angles 1s the 
question of optimal stopping: how much evidencelcan be regarded as. 
sufficient? The mathematical concept of "martingale 11 allows one. 
vector of information (r;ead student 1 s responses jto test Items) to be 
associated with another vector of information (read diagnostic 
utilities of item responses). At some determinable point, the 
expected information gain can be calculated precisely if the elements 
of both vectors are known (De Groot, 1970, p. 356 ff). In the present 
instance, the elements in the latter vector can be represented 
probabi 1 1 sti cal ly . N ! 

Each djagnastic hypothesis carries a certain a priori probability 
of being true, which varies from one hypothecs to another. An_^ 
efficient diagnostic testing process accuijul ^es^ev14ence to he^pT^^ 

differentiate between hypotheses by reevaluating the belief 

" • : ' - \ ■ 

probabilities. of each of the diagnostic hypotheses after each* 
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response. At some discrete point in the process the probability of 
-one of the wdrking hypotheses will become sufficiently high to justify 
a report of it as a? probable diagnosis . Bayes theorem is used to 
aggregate the evidence. 

The following notation allows demonstration of diagnostic 
assessment of mastery within a specified curricular subdomain, 

(a) Diagnostic hypotheses 



Ho 

Hi 
H 2 

H 4 
H 5 



The student has mastered the subdomain 

The student has not (nastered the subdomain 
.due to one of five specified learning gaps, 
miscoriceptlons or misunderstandings. 



ERLC 



(b) Probabilities 

Pi : the probability that Hypothesis Hj 1s correct 

for a particular student, given that one and only 
i>ne of [H 0 ... H5] is correct. 

(k) 

Pi is the probability after k items have been attempted 

* • ** ' • 

1 

(c) Distractors and flags 

Suppose that each Item has four choices and the "event" that 
the subject chooses the first one on item Ic is coded as: 
= [1,0,0,0]; Xjk * 1 1f alternative j on item k fs 
selected, and » 0 otherwise. 
Each alternative 1s flagged to one and only one of^he^^^^^ 
hypotheses; dj^ a 1 if the jth alternative on^item^ris flagged to 
hypothesis .1, - ^ , - 



so 
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The prior probabilities specify. that 1n the absence of , any 
specific evidence, each diagnostic hypothesis 'and the mastery 
1 , condition are equally likely* Until empirical studies establish the 
mathematical basis for a more sophisticated Bayesian model, the 
following will be used: 

p\ o) = l/1 n =0.167 for 1 - 0 ... 5 
The. conditional probabilities indicate the strength of the 

diagnostic Information provided by a single- test 1tem. c Although 
♦ 

empirical studies. will be necessary to establish thls'characteHstlc 
for £ny particular type of item, past experience suggests that the 
pr6bab1lfty of selecting the response flagged to a particular and true 
hypothesis will be somewhere 1n the range 0.4 and 0.8. A starting 
value of 0.55 would thus seem to be fairly conservative. 



Prob. [Xi; I Hi] .*- 0.55 1f the chosen alternative 1s 

flagged to hypothesis 1 (I.e., 1f 
xjk = 1 and djk » 1 for some'of j) 



0.15 if one of the rejected altern- 
atives 1s flagged to hypothesis 1 
(I.e., if xj k =0 and dj k = l for 
some j). 

0.25 1f, for this Item, none of the 
distractors are flagged for 
hypothesis "1 (I.e., 1f dj^ = 1 
for all of j). 
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Bayes theorem, now expressed as 

P r f 



{k) Pi . Prob.CX | Hi] 
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pj k ' n . Prob.CX | Hj] 

/ 

1 " / 

allows an algorithm to be established for the sequencing of test 

materials and through such alogrithm the basis fo r form ing a decision 

as ta whether to continue testing. Within an subdomain: 

for the first item : Select at random from the full set. 

for the second item : (a) Identify all hypotheses not 

covered in the preceding item. 

(b) Select at random from Items that 
include all hypotheses so > identified; 

' v for the third Item f^la) Identify all hypotheses not covered 

twice 1n the preceding items. 

(b). Select an Item which covers as many .of 
these as possible. 

for the, fourth item : (a) Identify all hypotheses not covered at 

, ' least twice '1n the first three items. 

(b) Identify the hypothfos with the 1 
greatest F!-value. 

(c) Select an Item to cover hypotheses 
Identified 1n ,(a) and (b) above. 

Discontinue testing when P-j reaches a confidence level of at 
least 0.8 for some 1. Hypothesis H«| is then reported. 

These rules are concise and' straightforward. If the student 
responds consistently, they should lead rapidly to the identification 
of the appropriate diagnosis.*' f 
b.) Illustration of probabilistic differentiation - 

Consider a sequence of six multiple choice items whose responses 
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have been flagged to indicate the (relevant hypothesis according to the 
following tabie. (H 0 denotes the Correct answer in each case.) 





' 1 1 


* EM 


Response 


1 


2 


3 


4 


5 


6 


A 


Ho 


Hi 


H< 


| Hi 


H 0 


H2 


B 


Hi 


*H 0 


He 


H 2 


H3 


HO 


C 


H 2 


H 3 


H 0 H 3 


H 2 


H 4 


D 


H3 


H5 


H] 


. H o 


Hi 


. H5 



The prior probabilities of each hypothesis are set equal to 0.167. 
Response A to item 1 provides some ; evidence/ of mastery of the 
subdomain by the examinee. Bayes theorem comb i nes "this evidence 'with 
the prior probabilities to give a new probability of mastery: 



,(1) _ 



0.167 x 0.55 



0.167 x 0.55 + 0.167 x' 0.15 +'.., 



0.367 



The probabilities of the other; hypotheses are similarly 



recalculated, and recorded in a table whllch gives the probabilities of 
the various hypotheses and the response^ selected on successive ,, 
items.'. If the subject's response to item 2 was B and to item 3 was C, 
the following table results. 

Calci 
>1 



:ulated probabilities 



Prior Values 
Item 1 - Response A 
Item 2 - Response B 
Item 3 - Response C 




p 2 


P 3 


?4 


p 5 • 


.167 


• .167 


.167 


.167 ■ 


.100 


.100 


.167 


.167 


.076 


.046 ■ 


.128 


.077 


.046 


.027 


.046 


.028 
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After three items testing could be discontinued since the probability 

of hypothesis H 0 (= mastery) has risen over 0.8. 

I / . " 1 

Next consider a subject who gives a somewhat less consistent 

pattern^€^responses. He chooses the alternative flagged for Hi 

except on Item 2 where\his response is flagged for H5. 

^Ga^k^l^ted probabilities 

P 0 Pi P 2 P3 P4 P5 



Prior values 
Item 1 - Response B 
Item 2 - Response D 
Item 3 - Response D 
'Item 4 - Response A 
Item 5 - Response D 



.167 
.100 
.061 
.035 
.013 
„003 



.167 
.367 
.226 
.484 
.709 



.167, 

.100 

.102 

.010 

.039 



.858* .012 



.167 
.100 
.061 
.059 
.023 
.007 



.167 
;167 
.171 
.100 
.066 
.036 



.167 
.167 
.377 
.220 
.146 
.080 



In this case we discontinue testing after five items and report Hi . 
Finally, consider a subject who chooses the response appropriate to He 
when one is available, but guesses when one is not. 

Calculated probabilities 







pi 


P 2 


P3 


P4 


P5 






Prior values 


,167 


.167 


.167 


.167 


.167 


.167 


Item 


1 


- Response B 


.090 


.367 


.100 


.100 


.167 


.167 


Item 


2 


- Response 0 


'.061 


i .226 


.102 


.061 


.171 


.377 


Item 


3 


- Response B 


* .028 


.107 


.080 


.048 


.080 


.654 


Item 4 


- Response 0 


..065 


.068 


.051 


.030 


.085 


.698 . 


Item 


5 


- Response B 


.040 


.042 


.031 


.068 


.088 


.727 


Item 


6 


- Response 0 


.013 

1 


.023 


.010 


.037 


.029* 


.886* 



In this case, all six sterns are needed before a hypothesis reaches the 
specified confidence levd. 

Note that 1n each of three cases above, the subject responded 

I ' '" * " 

with a fair degree of consistency. Subjects who respond 



inconsistently will need to be given more items before any hypothesis 

, < ■ - 

is established. Iri practice, if this can not be done by ten items 
then it may be best to report this fact and move on to a?» ^her area* 

Note als6 that in the above examples, the six items were 
attempted in a fixed order, in general not 'the most efficient 
procedure. For example, aftfer the third subject had attempted three 
items, H5 was clearly established as the mo^t probable hypothesis. It 

A ^ 

would have bee^i better to then. administer item 6 (which relates to H5O 
rather than items 4 and 5 (Which do not). If this had been done (and 
response D was still selected) testing could have been terminated 
immediately. 
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